Cross-Instance Tuning of Unsupervised Document Clustering Algorithms

نویسندگان

  • Damianos Karakos
  • Jason Eisner
  • Sanjeev Khudanpur
  • Carey E. Priebe
چکیده

In unsupervised learning, where no training takes place, one simply hopes that the unsupervised learner will work well on any unlabeled test collection. However, when the variability in the data is large, such hope may be unrealistic; a tuning of the unsupervised algorithm may then be necessary in order to perform well on new test collections. In this paper, we show how to perform such a tuning in the context of unsupervised document clustering, by (i) introducing a degree of freedom, α, into two leading informationtheoretic clustering algorithms, through the use of generalized mutual information quantities; and (ii) selecting the value of α based on clusterings of similar, but supervised document collections (crossinstance tuning). One option is to perform a tuning that directly minimizes the error on the supervised data sets; another option is to use “strapping” (Eisner and Karakos, 2005), which builds a classifier that learns to distinguish good from bad clusterings, and then selects the α with the best predicted clustering on the test set. Experiments from the “20 Newsgroups” corpus show that, although both techniques improve the performance of the baseline algorithms, “strapping” is clearly a better choice for cross-instance tuning. ∗This work was partially supported by the DARPA GALE program (Contract No ̄ HR0011-06-2-0001) and by the JHU WSE/APL Partnership Fund.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

On the Comparison of Semi-Supervised Hierarchical Clustering Algorithms in Text Mining Tasks

Semi-supervised clustering approaches have emerged as an option for enhancing clustering results. These algorithms use external information to guide the clustering process. In particular, semi-supervised hierarchical clustering approaches have been explored in many fields in the last years. These algorithms provide efficient and personalized hierarchical overviews of datasets. To the best of th...

متن کامل

High-Dimensional Unsupervised Active Learning Method

In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...

متن کامل

Optimization of sediment rating curve coefficients using evolutionary algorithms and unsupervised artificial neural network

Sediment rating curve (SRC) is a conventional and a common regression model in estimating suspended sediment load (SSL) of flow discharge. However, in most cases the data log-transformation in SRC models causing a bias which underestimates SSL prediction. In this study, using the daily stream flow and suspended sediment load data from Shalman hydrometric station on Shalmanroud River, Guilan Pro...

متن کامل

Proposing a Novel Cost Sensitive Imbalanced Classification Method based on Hybrid of New Fuzzy Cost Assigning Approaches, Fuzzy Clustering and Evolutionary Algorithms

In this paper, a new hybrid methodology is introduced to design a cost-sensitive fuzzy rule-based classification system. A novel cost metric is proposed based on the combination of three different concepts: Entropy, Gini index and DKM criterion. In order to calculate the effective cost of patterns, a hybrid of fuzzy c-means clustering and particle swarm optimization algorithm is utilized. This ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007